logo

Introduction

This document describes the data created for APRA’s fundraising data science online learning courses and workshops. All of the data created for these purposes is fictitious.

There are three data sets available as of 2020-08-12:

  • Biographical (donor level)
  • Giving (gift level)
  • Engagement (donor level)

Each of these tables and the variables contained within each are described below. There are tabs included throughout the document that can be used to explore the variables included in each data set.

These data sets are designed to mirror realistic fundraising data and are not intended to be perfectly “clean” data. There are common fundraising data challenges built into the data files. For example, you can click on the Biographical Data tab above to learn more about that data set.

All of the code for this project is available on GitHub. The code that generates the data sets can be found in the generate_data.R r script.

The individual datasets can be read into R directly from github as follows.

# load the tidyverse library
library(tidyverse)
library(knitr)

# read bio data csv into R and store in a data frame named bio
bio <- read_csv("https://raw.githubusercontent.com/majerus/apra_data_science_courses/master/bio_data_table.csv")

bio %>% 
  sample_n(10) %>% 
  select(id, name, birthday, city, state, capacity, capacity_source) %>%
  kable()
id name birthday city state capacity capacity_source
6978639 Tinner, Stephanie 1951-05-30 Jefferson city MO $1k - $2.5k institutional
8941505 el-Akram, Nawaar 1956-05-28 Long beach CA $75k - $100k screening
4242470 Barela, Elizabeth 1979-07-24 Mckinney TX $500k - $750k NA
4258011 Walsh, Dante 1970-01-05 Gibsonville NC $10k - $25k screening
6843645 Dhindsa, Daniel 1963-03-16 Greeley CO $25k - $50k screening
3717480 Geisert, Jon 1969-07-29 Waxahachie TX $50k - $75K screening
7456756 Jain, Summer 1999-01-25 Crestwood KY $25k - $50k screening
8827814 Bates, Ryan 1951-09-25 Greeley CO NA institutional
8891271 Mills, Carly 1923-03-06 Cincinnati OH $100k - $250k screening
6234290 Bagaporo, Hai 1962-03-15 Eastpointe MI $50k - $75K NA

Biographical Data

The biographical data has 14 variables and 100,000 observations. The data is stored at the donor level. Each row of the data represents a unique donor and biographical information about that donor.

Numeric Variables

There are 4 numeric variables:

  • id: A seven digit numeric id that is unique to each donor.
  • household_id: A seven digit numeric id that is unique to households. More than one donor may share a household_id.
  • lat: The latitude of the center point of each donor’s zipcode. Missing for donor’s residing outside the United States.
  • lon: The longitude of the center point of each donor’s zipcode. Missing for donor’s residing outside the United States.
## Rows: 100,000
## Columns: 4
## $ id           <dbl> 8275707, 2963581, 4302254, 7637444, 9369155, 1026439, 65…
## $ household_id <dbl> 1000235, 1000235, 1000303, 1000341, 1000341, 1000435, 10…
## $ lat          <dbl> 34.03, 41.29, NA, 36.07, 26.23, 33.60, 40.99, 38.82, 32.…
## $ lon          <dbl> -117.75, -92.63, NA, -94.15, -80.13, -117.71, -74.34, -7…

Character Variables

When loaded by default there are 9 character variables:

  • name: Each donor’s first and last name formatted as “last name, first name”.
  • country: Each donor’s country of residence.
  • city: Each donor’s city of residence.
  • deceased: A binary indicator that indicates if a donor is deceased (“Y”|“N”)
  • zip: The five digit zipcode of donor’s whose country of residence is the United States.
  • state: The two-letter state abbreviation for each donor whose country of residence is the United States.
  • capacity: Each donor’s capacity represented within an estimated range.
  • capacity_source: A categorical variable indicating how the capacity was determined (“institutional”|“screening”).
  • race: a categorical variable indicating the donor’s race.
## Rows: 100,000
## Columns: 9
## $ name            <chr> "al-Shakoor, Labeeb", "Nero, Brianna", "al-Rasheed, R…
## $ country         <chr> "United States", "United States", "China", "United St…
## $ city            <chr> "Pomona", "Oskaloosa", "Shenzhen", "Fayetteville", "P…
## $ deceased        <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ zip             <chr> "91766", "52577", NA, "72701", "33069", "92653", "074…
## $ state           <chr> "CA", "IA", NA, "AR", "FL", "CA", "NJ", "VA", "TX", N…
## $ capacity        <chr> "$50k - $75K", "$1k - $2.5k", "$5k - $10k", "$2.5k - …
## $ capacity_source <chr> "screening", "screening", "screening", "institutional…
## $ race            <chr> "Non-Hispanic white", "Non-Hispanic white", "Asian", …

country

deceased

state

capacity

capacity_source

race

Date Variables

There is 1 date variable:

  • birthday: The date of each donor’s birth stored as a date variable.
## Rows: 100,000
## Columns: 1
## $ birthday <date> 1923-11-18, 1925-03-18, 1924-08-28, 1923-05-14, 1921-10-11,…

Giving Data

The giving data has 6 variables and 540,000 observations. The data is stored at the gift level. Each row of the data represents a unique gift and attributes associated with that gift.

Numeric Variables

There are 4 numeric variables:

  • id: A seven digit numeric id that is unique to each donor, but can repeat in the giving data for donors with more than one gift.
  • household_id: A seven digit numeric id that is unique to households.
  • gift_id: A seven digit numeric id that is unique to each gift.
  • gift_amt: The total gift amount received (i.e., total about of cash received in USD).
## Rows: 540,000
## Columns: 4
## $ household_id <dbl> 2231010, 9276150, 4132585, 6308003, 1235119, 8185048, 11…
## $ id           <dbl> 2004705, 3496504, 1679611, 9229575, 3229105, 9841718, 57…
## $ gift_id      <dbl> 1000064, 1000392, 1000612, 1000726, 1000853, 1000937, 10…
## $ gift_amt     <dbl> 24474, 590, 530, 222, 691, 431, 250, 984, 12103, 184750,…

Character Variables

When loaded by default there is 1 character variable:

  • credit_type: A categorical variable that indicates if a gift is counted as hard-credit or soft-credit.
## Rows: 100,000
## Columns: 9
## $ name            <chr> "al-Shakoor, Labeeb", "Nero, Brianna", "al-Rasheed, R…
## $ country         <chr> "United States", "United States", "China", "United St…
## $ city            <chr> "Pomona", "Oskaloosa", "Shenzhen", "Fayetteville", "P…
## $ deceased        <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ zip             <chr> "91766", "52577", NA, "72701", "33069", "92653", "074…
## $ state           <chr> "CA", "IA", NA, "AR", "FL", "CA", "NJ", "VA", "TX", N…
## $ capacity        <chr> "$50k - $75K", "$1k - $2.5k", "$5k - $10k", "$2.5k - …
## $ capacity_source <chr> "screening", "screening", "screening", "institutional…
## $ race            <chr> "Non-Hispanic white", "Non-Hispanic white", "Asian", …

Date Variables

There is 1 date variable:

  • gift_date: The date that each gift was received.
## Rows: 100,000
## Columns: 1
## $ birthday <date> 1923-11-18, 1925-03-18, 1924-08-28, 1923-05-14, 1921-10-11,…

Engagement Data